Class link

Class GitHub

1 Know your data

  1. Data Literacy
  2. Data types
  3. Visual Vocabulary

1.1 What is Data Visualization?

Data visualization is to deliver a message from your data. It is like telling a story using the chart or data applications. Sometimes the data is huge or the story to too long to tell. Visualization provides an ability to comprehend huge amounts of data. The important information from more than a million measurements is immediately available.

Visualization often enables problems with the data to become immediately apparent. A visualization commonly reveals things not only about the data itself but also about the way it is collected. With an appropriate visualization, errors and artifacts in the data often jump out at you. For this reason, visualizations can be invaluable in quality control.

Visualization facilitates understanding of both large-scale and small-scale features of the data. It can be especially valuable in allowing the perception of patterns linking local features.

Visualization facilitates hypothesis formation, inviting further inquiries into building a theory (Colin Ware 2012). It is exploratory data anlaysis (EDA) but can also provide the tools for hypothesis confirmation.

1.2 Learn to read data

Edward Tufte is one of the earliest data scientists emphasizing visual thinking. He postulates that

“If you’re not doing something different, you’re not doing anything at all.” - Edward Tufte

“It comes with the package, to reconcile [yourself] to life’s inevitable trade-offs and heartaches.” - Edward Tufte

Edward Tufte




Example: Multi-dimensional plot of data

Example: Multiple dimensions of data


Example: Multi-dimensional plot of data

1.2.1 Hands-on workshop: Data programming

2 Data Programming

This session starts with basic principles for data programming or coding involving data. Data programming is a practice that works and evolves with data. Data programming or coding allows the user to manage and process data in more effective manner. Programs are designed to be replicated or replicable by user and collaborators. A data program can be developed and updated iteratively and incrementally. In other words, it is building on the culminated works without repeating the steps. It takes debugging, which is the process of identifying problems (bugs) but, in fact, updating the program in different situations or with different inputs when used in different contexts, including the programmer himself or herself working in future times.

2.1 Data programming in R

R basics

# Create variables composed of random numbers using the rnorm function
x <-rnorm(50) 
y = rnorm(x)

# Plot the points in the plane 
plot(x, y)

2.1.1 Using R packages

# Plot better, using the ggplot2 package 
## Prerequisite: install and load the ggplot2 package
## install.packages("ggplot2")
library(ggplot2)
qplot(x,y)

2.2 Data Visualization with R

# Plot better better with ggplot2
x <- rnorm(50) 
y = rnorm(x)
ggplot(,aes(x,y)) + theme_bw() + geom_point(col="blue")

Taiwan Election and Democratization Study 2016 data

Taiwan Election and Democratization Study (TEDS) is one of the longest and most comprehensive elections studies starting in 2001. TEDS collects data through different modes of surveys including face-to-face interviews, telephone interviews and internet surveys. More detail of TEDS can be found at the National Chengchi University Election Study Center website at https://esc.nccu.edu.tw/main.php.

# Import the TEDS 2016 data in Stata format using the haven package
##install.packages("haven")

library(haven)
TEDS_2016 <- haven::read_stata("https://github.com/datageneration/home/blob/master/DataProgramming/data/TEDS_2016.dta?raw=true")

# Prepare the analyze the Party ID variable 
# Assign label to the values (1=KMT, 2=DPP, 3=NP, 4=PFP, 5=TSU, 6=NPP, 7="NA")

TEDS_2016$PartyID <- factor(TEDS_2016$PartyID, labels=c("KMT","DPP","NP","PFP", "TSU", "NPP","NA"))

Take a look at the variable:

# Check the variable
attach(TEDS_2016)
head(PartyID)
## [1] NA  NA  KMT NA  NA  DPP
## Levels: KMT DPP NP PFP TSU NPP NA
tail(PartyID)
## [1] NA  NA  DPP NA  NA  NA 
## Levels: KMT DPP NP PFP TSU NPP NA

Frequency table:

# Run a frequency table of the Party ID variable using the descr package
## install.packages("descr")
library(descr)
freq(TEDS_2016$PartyID)

## TEDS_2016$PartyID 
##       Frequency  Percent
## KMT         388  22.9586
## DPP         591  34.9704
## NP            3   0.1775
## PFP          32   1.8935
## TSU           5   0.2959
## NPP          43   2.5444
## NA          628  37.1598
## Total      1690 100.0000

Get a better chart of the Party ID variable:

# Plot the Party ID variable
library(ggplot2)
ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar()

We can attend to more detail of the chart, such as adding labels to x and y axes, and calculating the percentage instead of counts.

ggplot2::ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties")

Adding colors, with another theme:

ggplot2::ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw()

Hold on, colors are not right!

## Better color control
## install.packages("RColorBrewer")
library(RColorBrewer)
ggplot2::ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw() +
  scale_fill_manual(values=c("steel blue","forestgreen","khaki1","orange","goldenrod","yellow","grey")) 

To make the chart more meaningful, we can use a package called tidyverse to manage the data.

##install.packages("tidyverse")
library(tidyverse)
TEDS_2016 %>% 
  count(PartyID) %>% 
  mutate(perc = n / nrow(TEDS_2016)) -> T2
ggplot2::ggplot(T2, aes(x = reorder(PartyID, -perc),y = perc,fill=PartyID)) + 
  geom_bar(stat = "identity") +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw() +
  scale_fill_manual(values=c("steelblue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))

## Customize font

ggplot2::ggplot(T2, aes(x = reorder(PartyID, -perc),y = perc,fill=PartyID)) + 
  geom_bar(stat = "identity") +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw() +
  scale_fill_manual(values=c("steelblue","forestgreen","khaki1","orange","goldenrod","yellow","grey")) +
  theme(text=element_text(size=14, family="Palatino"))

## Resize and reposition the legend

ggplot2::ggplot(T2, aes(x = reorder(PartyID, -perc),y = perc,fill=PartyID)) + 
  geom_bar(stat = "identity") +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw() +
  scale_fill_manual(values=c("steelblue","forestgreen","khaki1","orange","goldenrod","yellow","grey")) +
  theme(text=element_text(size=14, family="Palatino"), legend.title = element_text(size=10), legend.position = c(0.8, 0.7)) 

2.4 References:

Graham Williams 2011. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge